Back

Journal of Bioinformatics and Systems Biology

Fortune Journals

Preprints posted in the last 90 days, ranked by how well they match Journal of Bioinformatics and Systems Biology's content profile, based on 14 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
Prediction and analysis of new HisKA-like domains

Silly, L.; Perriere, G.; Ortet, P.

2026-03-02 bioinformatics 10.64898/2026.02.27.708494 medRxiv
Top 0.1%
3.5%
Show abstract

Histidine kinases (HKs) are part of many signaling pathways, by being implicated in two components systems (TCS). Using autophosphorylation and phosphotransfer to a response regulators (RR), they enable organisms to adapt to their environment. Most HKs are transmembrane proteins with a sensing domain outside of the cell and two catalytic domains called HisKA and HATPase. HATPase is required for interaction with the ATP and HisKA contains the phosphorylated histidine residue. HKs are involved in various environmental adaptation mechanisms, like light sensing or biochemical changes. Studying their diversity is therefore important to better understand how cells interacts with their environment. There exist incomplete HKs (iHKs) lacking either the HisKA or HATPase domain. Some iHKs with an HATPase domain possess a section of their sequence where an HisKA domain could be expected. These iHKs may contain "true" HKs, with unknown HisKA domain, that could fill gaps in various signaling pathways. In this study we analyzed 869 964 sequences of iHKs having an HATPase domain but lacking an HisKA domain. We identified 18 HisKA-like profiles and did multiple meta-studies to assessed their HisKA-like characteristics. We found that their 3D structures matched the structure of known HisKA domains. We saw that the genomic context of the genes associated to these profiles contained genes implicated in signal transduction pathways. We cross-validated some of our profiles with curated annotations, as well as with a "negative dataset" made of non-HK proteins. We believe that our work could help improve the annotation of regulation pathways in prokaryotes.

2
Genes near tRNAs are enriched in translational machinery

West, C.; Dineen, L.; LaBella, A. L.

2026-03-16 bioinformatics 10.64898/2026.03.12.711363 medRxiv
Top 0.1%
3.0%
Show abstract

Transfer RNAs (tRNAs) are known for delivering amino acids to the growing polypeptide chain during translation. They can also influence gene expression, especially in times of nutrient starvation, through differential tRNA expression and modification. tRNAs have a highly consistent cloverleaf structure, but relatively few known regulatory elements govern this conserved structure despite the 20 different standard isotypes. This study examines gene enrichment patterns near tRNA in 1154 fungal genomes. Genes enriched in proteasome regulation, ion transport, and rRNA were found to be significantly closer to tRNAs than other pathways. These results were consistent across KEGG over-representation analysis (ORA), KEGG Gene Set Enrichment Analysis (GSEA), and Gene Ontology (GO) analysis. Proteasome, ion transport, and RNA are all important aspects of protein production and regulation, suggesting that genes required for the synthesis and quality control of proteins, including tRNAs, are located near each other. Protein regulation is an energetically expensive process, and local co-regulation could increase efficiency and stress impacts on proteins.

3
Evaluating the reliability of tools for mRNA annotation and IRES studies

May, G. E.; Akirtava, C.; McManus, J.

2026-03-31 genomics 10.64898/2026.03.29.707813 medRxiv
Top 0.1%
2.8%
Show abstract

Since the discovery of viral Internal Ribosome Entry Sites (IRESes), researchers have sought to find similar elements in mammalian host genes, termed "cellular IRESes". However, the plasmid systems used to measure cellular IRES activity are vulnerable to false positives due to promoter activity in candidate IRESes. Orthogonal methods are needed to validate putative IRESes while carefully avoiding artifacts known to cause false positives. Recently, Koch et al. proposed approaches for studying IRESes, primarily circular RNA-generating plasmids, and for validating mRNA transcripts using smFISH and qRT-PCR. Here, we demonstrate confounding variables and artifacts in each of these approaches that can lead to inappropriate conclusions about potential cellular IRES activity. We show the back-splicing circRNA plasmid creates linear mRNA artifacts associated with false-positive IRES signals. Using orthogonal, gold-standard assays validated with viral IRESes, we find putative cellular IRESes reported using the back-splicing plasmid have no IRES activity. Furthermore, we demonstrate that smFISH and qRT-PCR can misidentify nuclear non-coding RNAs as mRNAs and we validate a single molecule sequencing assay for identifying genuine mRNA 5 ends. Our work establishes reliable methods for robust transcript annotation and IRES studies that avoid documented artifacts arising from bicistronic and back-splicing circRNA plasmid reporters.

4
TOXsiRNA: A web server to predict the toxicity of chemically modified siRNAs

Dar, S.; Kumar, M.

2026-02-14 bioinformatics 10.64898/2026.02.12.705521 medRxiv
Top 0.1%
2.4%
Show abstract

Small interfering RNAs (siRNAs) are largely modified with chemical molecules to enhance their properties for use in molecular biology research and therapeutic applications. Toxicity effects may arise due to these chemical moieties as well as sequence based off-targets at cellular level. Enormous resources are required to experimentally design and test the toxicity of these chemical modifications and their combinations on siRNAs. To address this problem, we developed TOXsiRNA web server to computationally predict the toxicity of chemically modified siRNAs and their off-targets. We selected 2749 siRNAs with different permutations and combinations of 21 different chemical modifications engineered on them. Next, we used Support Vector Machine (SVM), Linear Regression (LR), K-Nearest Neighbor (KNN) and Artificial Neural Network (ANN) machine learning applications to develop models. Best performance was displayed by mononucleotide composition-based model developed with SVM, offering Pearson Correlation Coefficient (PCC) of 0.91 and 0.92 on training testing and independent validations respectively. Other sequence features like dinucleotide composition binary pattern and their combinations were also tested. Finally, three models of chemically modified siRNAs were implemented on the web server. Other algorithms that include predicting normal as well as chemically modified siRNA knockdown efficacy, off target etc. are also integrated. The resource is hosted online for scientific use freely at url: http://bioinfo.imtech.res.in/manojk/toxsirna.

5
Computational insights into the interaction between Topoisomerase I and Rpc82 subunit of RNA Polymerase III in Saccharomyces cereviseae

Nandi, P.; Kamal, I. M.; Chakrabarti, S.; Sengupta, S.

2026-02-03 bioinformatics 10.64898/2026.01.31.703072 medRxiv
Top 0.2%
1.7%
Show abstract

The process of DNA transcription leads to the generation of torsional stress, which must be resolved for smooth progression of the transcription machinery. In Saccharomyces cerevisiae, DNA topoisomerase I (Top1), a type IB topoisomerase, plays a critical role in relaxing supercoils and mitigating the topological strain associated with transcription. While several proteins from the transcription machinery have been reported to interact with yeast Top1, detailed characterization and functional relevance of these interactions have remained underexplored. This gap is partly due to the absence of a complete three-dimensional structure of the full-length enzyme, which hinders structure-based computational analyses of its interactome. In this study, we present a template-based model of full-length yeast Top1. Leveraging this model, we investigated its molecular interaction with Rpc82, a key subunit of RNA polymerase III enzyme, responsible for transcribing small non-coding RNAs such as tRNAs and 5S rRNA. Through molecular docking and molecular dynamics simulations, critical residues at the Top1-Rpc82 interface were identified that likely mediate their interaction. Our findings provide new insights into the structural basis of Top1s association with RNA polymerase III and its potential role in regulating Pol III-mediated transcription. The Top1 model developed here offers a valuable framework for future in silico studies aimed at elucidating the broader interactome and regulatory mechanisms of this essential enzyme.

6
A functional annotation based integration of different similarity measures for gene expressions

Misra, S.; Roy, S.; Ray, S. S.

2026-02-24 bioinformatics 10.64898/2026.02.23.707392 medRxiv
Top 0.2%
1.5%
Show abstract

Genes with similar expression profiles often exhibit similar functional properties. An "integrated similarity score" (ISS) is developed by combining different expression similarity measures through weights, obtained using biological information, for improving gene similarity. The expression similarity measures are converted to the common framework of positive predictive value using functional annotation. A fitness function, called "fitness function using functional annotation of genes" (FFFAG), is also developed by minimizing the difference between functional similarity value and the ISS. The FFFAG is used to determine the weight combination of different similarity measures in ISS. In addition, an existing similarity measure, called TMJ (integrated similarity measure by multiplying Triangle and Jaccard similarity), is also modified to incorporate biological knowledge involving functional annotation. The results demonstrate that ISS is superior to individual similarity measure to find similar gene pairs. Further, the ISS predicts the functional categories of 40 unclassified yeast genes at p-value cutoff of 10-10 from 12 clusters. The associated code is accessible at http://www.isical.ac.in/[~]shubhra/ISS.html.

7
Bias in genome-wide association test statistics due to omitted interactions

Yelmen, B.; Güler, M. N.; Estonian Biobank Research Team, ; Kollo, T.; Möls, M.; Charpiat, G.; Jay, F.

2026-02-22 bioinformatics 10.1101/2025.11.21.689603 medRxiv
Top 0.2%
1.3%
Show abstract

Over the past two decades, genome-wide association studies (GWAS) enabled the discovery of thousands of variants associated with many complex human traits. However, conventional GWAS are still widely performed with linear models with the assumption that the genetic effects are predominantly additive. In this work, we investigate the test statistic behavior when linear models are used to obtain significant genotype-phenotype associations without accounting for epistasis. We first algebraically derive mean and variance shift in the null statistic due to the omitted interaction term, and define the boundary between conservative (i.e., deflated statistic tail) and anti-conservative (i.e., inflated statistic tail) regimes for the common GWAS significance threshold. We then perform phenotype simulation analyses using the Estonian Biobank genotypes and validate the mathematical model. We demonstrate that the anti-conservative regime is plausible under realistic parameter settings and models omitting interaction terms can produce spurious significance. Our findings suggest caution when interpreting statistically significant signals reported in the literature based on linear models, especially for large-scale GWAS.

8
A Course-Undergraduate Research Experience (CURE) to explore the effect of structural variants on gene expression in C. elegans balancers

Maroilley, T.; Barbosa, V. R. A.; Mascarenhas, R.; Ferris, S.; Diao, C.; AlAwadhi, F.; Aldakheel, S.; Ali, A.; Alkanderi, D.; Alshatti, M.; Alsuwaileh, S.; Asghar, K.; Bui, R.; Chai, B.; Dsouza, L.; Nezhad, P. E.; Garcia-Volk, E.; Haq, Z.; Hossain, S.; Johnson, G.; Kotikalapudi, N.; Lalani, I.; Lenz, C.; Louie, T.; Moore, S.; Patel, S.; Prasai, S.; Qureshi, R.; Rahmani, F.; Shakir, B.; Ahamed, S. S.; Tran, H. A.; Waziha, R.; Wood, C. M.; Zbinden, S.; Anderson, D.; Tarailo-Graovac, M.

2026-01-23 bioinformatics 10.64898/2026.01.21.700799 medRxiv
Top 0.3%
1.3%
Show abstract

Bioinformatics, a discipline at the crossroads of Biology and Computational Sciences, also referred to as Computational Biology, is nowadays widely spread in research programs. However, implementing any Bioinformatics projects requires the ability to comprehend biological concepts and apply computational approaches, and rare are the undergraduate programs offering such multi-disciplinary training. In addition, understanding the dynamic between Biology research projects and Bioinformatics analyses is challenging with no real-life experience. Course-based undergraduate research experience (CURE) courses are innovative programs that allow more students to acquire research experience and provide the perfect setting to introduce students to applied bioinformatics. As a part of the Bachelor of Health Sciences of the Cumming School of Medicine at the University of Calgary (Canada), a CURE applied bioinformatics was implemented in the Winter of 2023 to 2025. Students investigated the effect of structural variants (SVs, genetic variants larger than 50 bp) on gene expression in the model organism Caenorhabditis elegans (a hermaphrodite 1-mm long roundworm). The students detected and characterized SVs by analyzing genome and transcriptome sequencing data of C. elegans strains called balancers, as they are known to carry large genomic variations balancing regions of the genome by limiting recombination and allowing maintenance of lethal mutations. They used Galaxy, a public web-based supercomputing resource, but also a local High-Performance computing system, and R, to report different effects of SVs on gene expression and splicing. Students research explained the molecular mechanism behind the uncoordinated phenotype caused by the reciprocal translocation eT1(III;V) and uncovered unexpected effects on gene expression on an understudied gene. We evaluated the courses impact on student learning journeys and showed that the CURE favored students understanding of the Bioinformatics field and fostered their research interest. We provide here guidelines to facilitate the CURE implementations to improve access for undergraduate students to bioinformatics research experiences.

9
On the Accuracy of Internal Circadian Time Prediction Methods from a Single Sample

Gorczyca, M. T.

2026-02-13 bioinformatics 10.64898/2026.02.11.705208 medRxiv
Top 0.3%
1.2%
Show abstract

Biological processes ranging from gene expression to sleep-wake cycles display oscillations with an approximately 24-hour period, or circadian rhythms. A challenge in analyzing circadian rhythms is that these rhythms vary across individuals and are based on an individuals internal circadian time (ICT), which is uniquely offset relative to the 24-hour day-night cycle time (zeitgeber time, or ZT). Many model-based methods have been proposed to predict ICT given an individuals biomarker measurements. However, the prediction accuracy of these methods is rarely validated using known ICT. In this article, we evaluate this accuracy for three state-of-the-art model-based methods: COFE, partial least squares regression, and TimeSignature. We find that if a single sample is obtained from each individual and a model is fit using only biomarker measurements as predictors, then ZT predicts ICT more accurately than any of the model-based ICT predictions. However, we also find that TimeSignature can outperform ZT when the model incorporates sine and cosine transforms of sample collection ZT as two additional predictors. These findings are based on analysis of three circadian transcriptome datasets as well as simulation studies, and highlight the importance of accounting for individual-level differences in biomarker oscillations to improve ICT prediction.

10
Multistage Machine Learning Reveals Circadian Gene Programs and Supports a Retina-Choroid Axis in Myopia Development

Watcharapalakorn, A.; Poyomtip, T.; Tawonkasiwattanakun, P.; Dewi, P. K. K.; Thomrongsuwannakij, T.; Mahawan, T.

2026-04-06 bioinformatics 10.64898/2026.04.02.716020 medRxiv
Top 0.4%
0.9%
Show abstract

PurposeTo determine whether circadian timing defines critical molecular windows in myopia development and to assess the transferability of circadian gene programs across ocular tissues, disease stages, and species. MethodsPublicly available retinal and choroidal RNA-seq datasets from chick models of form-deprivation myopia were analyzed using unsupervised transcriptomic profiling and multistage machine-learning classification. Circadian windows were defined based on Zeitgeber time, and samples were grouped accordingly for downstream analyses. Classification model robustness was evaluated through cross-tissue and cross-stage validation and further assessed using external validation in an independent dataset. Functional translation to humans was examined using ortholog-based Gene Ontology enrichment analysis to identify conserved biological processes and higher-order regulatory pathways. ResultsA circadian critical window at ZT8-ZT12 exhibited the strongest transcriptional divergence during both myopia onset and progression. Gene signatures derived from this window generalized across retina and choroid and remained predictive across disease stages, supporting coordinated molecular regulation between ocular tissues. External validation confirmed the reproducibility of these signatures despite differences in experimental design and gene coverage. Functional mapping revealed that conserved molecular components in chicks are reorganized into more complex neuroendocrine and regulatory networks in humans, indicating cross-species conservation with increased functional complexity. ConclusionsCircadian timing strongly shapes myopia-related gene expression and underlies coordinated retina-choroid signaling. These findings highlight circadian biology as a key factor of refractive development and suggest that time-dependent mechanisms may influence myopia susceptibility, progression, and response to treatment.

11
lncRNA NORM is essential for proper chromosome segregation through the Plk1-Bub1 and Nsun2 axis.

Dongardive, V.; Jathar, S.; Srivastava, J.; Tripathi, V.

2026-03-16 cell biology 10.64898/2026.03.15.711899 medRxiv
Top 0.4%
0.9%
Show abstract

The cell cycle comprises different phases and is a tightly regulated process at the molecular level. During the cell cycle, two key events occurred: DNA duplication during the S phase and chromosome segregation during mitosis. Accurate cell cycle progression, achieved through faithful chromosome segregation, is essential for maintaining cell fidelity. Long noncoding RNAs are a subclass of noncoding RNA that are longer than 200 bp and form RNA protein complexes (RNPs) to regulate various biological processes. Herein, we demonstrate that lncRNA NORM is involved in regulating the cell cycle by maintaining proper chromosome segregation. NORM exhibited G2 phase-specific expression, and the depletion of NORM resulted in a significant G2/M arrest. NORM-depleted cells failed to progress in mitosis and showed defects in chromosome segregation. We further demonstrated that NORM binds to proteins such as Plk1 and Nsun2. Depletion of NORM hindered the interaction between Plk1 and Bub1, resulting in reduced kinetochore localization of Plk1 during prometaphase. Our results also show that the depletion of NORM affects the binding of Nsun2 protein to CDK1 mRNA and, consequently, the stabilization of CDK1 at the protein level. Altogether, our results demonstrate that NORM regulates chromosome segregation by mediating the interaction between Plk1 and Bub1.

12
PlotGDP: an AI Agent for Bioinformatics Plotting

Luo, X.; Shi, Y.; Huang, H.; Wang, H.; Cao, W.; Zuo, Z.; Zhao, Q.; Zheng, Y.; Xie, Y.; Jiang, S.; Ren, J.

2026-02-03 bioinformatics 10.64898/2026.01.31.702995 medRxiv
Top 0.5%
0.9%
Show abstract

High-quality bioinformatics plotting is important for biology research, especially when preparing for publications. However, the long learning curve and complex coding environment configuration often appear as inevitable costs towards the creation of publication-ready plots. Here, we present PlotGDP (https://plotgdp.biogdp.com/), an AI agent-based web server for bioinformatics plotting. Built on large language models (LLMs), the intelligent plotting agent is designed to accommodate various types of bioinformatics plots, while offering easy usage with simple natural language commands from users. No coding experience or environment deployment is required, since all the user-uploaded data is processed by LLM-generated codes on our remote high-performance server. Additionally, all plotting sessions are based on curated template scripts to minimize the risk of hallucinations from the LLM. Aided by PlotGDP, we hope to contribute to the global biology research community by constructing an online platform for fast and high-quality bioinformatics visualization. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=78 SRC="FIGDIR/small/702995v3_ufig1.gif" ALT="Figure 1"> View larger version (31K): org.highwire.dtl.DTLVardef@73c77dorg.highwire.dtl.DTLVardef@12ddfc6org.highwire.dtl.DTLVardef@be5963org.highwire.dtl.DTLVardef@de90b6_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical abstract.C_FLOATNO PlotGDP: an AI Agent for Bioinformatics Plotting C_FIG

13
Likelihood-Based Identification of Cell Division Mechanisms

Teichner, R.; Meir, R.; Brenner, N.

2026-01-20 cell biology 10.64898/2026.01.16.700002 medRxiv
Top 0.5%
0.8%
Show abstract

Cell size homeostasis in bacteria is a fundamental problem in systems biology, where cells maintain growth and division over many generations despite intrinsic fluctuations. Identifying the underlying control mechanism--whether division is triggered by reaching a critical size (sizer ) or by adding a fixed size increment (adder )--is essential for understanding this process. These two hypotheses are widely studied, yet there is no guarantee that either fully captures the true biological mechanism. More fundamentally, it has been unclear whether the control mechanism is statistically identifiable at all from lineage data. We address this question by developing a likelihood-based framework that explicitly accounts for threshold dynamics modeled as an Ornstein-Uhlenbeck process. Division timing is formulated as the first-passage-time (FPT) of this stochastic process to a time-dependent barrier. However, the FPT distribution lacks a closed-form analytical expression, preventing direct derivation of the maximum likelihood estimator (MLE). We overcome this challenge by training a neural network to approximate the FPT distribution and integrating it into the likelihood function, preserving analytical structure up to the FPT term. Simulations demonstrate that our method reliably distinguishes between sizer and adder mechanisms under realistic conditions where heuristic methods fail, providing the first evidence that the underlying control mechanism is identifiable. This hybrid analytical-machine learning approach provides a generalizable framework for studying stochastic threshold-based regulation in biological systems. Code reproducing the results is available at https://github.com/RonTeichner/newBacteria.

14
Using user-centered design to better understand challengesfaced during genetic analyses by novice genomicresearchers

Patel, H.; Crosslin, D.; Jarvik, G. P.; Hall, T.; Veenstra, D.; Xie, S.

2026-02-10 bioinformatics 10.64898/2026.02.06.704411 medRxiv
Top 0.6%
0.8%
Show abstract

The lack of user-centered design principles in the current landscape of commonly-used bioinformatics software tools poses challenges for novice genomics researchers (NGRs) entering the genomics ecosystem. Comparing the usability of one analysis software to that of another is a non-trivial task and requires evaluation criteria that incorporates perspectives from both existing literature and a diverse, underrepresented user base of NGRs. To better characterize these barriers, we utilized a two-pronged approach consisting of a literature review of existing bioinformatics tools and semi-structured interviews of the needs of NGRs. From both knowledge sources, the key attributes that resulted in poor adoption and sustained use of most bioinformatics tools included poor documentation, lack of readily-accessible informational content, challenges with installation and dependency coordination, and inconsistent error messages/progress indicators. Combining the findings from the literature review and the insights gained by interviewing the NGRs, an evaluation rubric was created that can be utilized to grade existing and future bioinformatics tools. This rubric acts as a summary of key components needed for software tools to cater to the diverse needs of both NGRs and experienced users. Due to the rapidly evolving nature of genomics research, it becomes increasingly important to critically evaluate existing tools and develop new ones that will help build a strong foundation for future exploration.

15
Awareness of the Importance of Genetic Counseling and Its Role in Preventing Genetic Disorders in Derna District

Al-Ghazali, M. A.; AL-MAYAR, D. I.; AL-FKHAKHRI, H. O.; AL-HIJAZI, H. M.

2026-02-02 genetic and genomic medicine 10.64898/2026.01.27.25342786 medRxiv
Top 0.6%
0.8%
Show abstract

This study examined awareness, attitudes, and perceived barriers regarding genetic counseling among individuals in Derna District, focusing on its role in preventing genetic disorders. A descriptive cross-sectional design was employed, involving 278 participants aged 17 to 45 years, selected through stratified random sampling. Data were collected using structured questionnaires and analyzed with descriptive statistics via SPSS version 26.0. The findings revealed that while 65.5% of participants reported a high level of knowledge about genetic counseling, significant gaps remain, with 34.5% indicating low knowledge. Most participants demonstrated positive attitudes: 90.6% believed genetic counseling is important for preventing genetic disorders, and 90.3% expressed willingness to undergo counseling if recommended by a physician. However, perceived barriers such as fear of results (39.9%) and lack of awareness (30.9%) were reported. The study highlights the need for targeted educational initiatives and policy measures to promote genetic counseling services and address identified barriers. The findings provide valuable guidance for public health programs aiming to enhance the utilization of genetic counseling in the region.

16
PanACRpred: Predicting Accessible Chromatin Regions in Pangenomes using Motif Chaining

Warr, M. J.; Dinh, T.; Root, B.; Onstott, E.; Yu, K.; Mudge, J.; Ramaraj, T.; Kahanda, I.; Mumey, B.

2026-02-06 bioinformatics 10.64898/2026.02.05.703812 medRxiv
Top 0.6%
0.8%
Show abstract

In this work, we investigate using motif subsequence features to predict whether a genomic region is accessible to regulatory proteins, i.e. an accessible chromatin region (ACR), enabling transcription of associated genes. We focus on plants, whose agricultural and ecological importance make them interesting and important organisms to study, and whose complex genomes provide important stress tests for our algorithm. We show that motif sequence similarity as found by co-linear chaining can be used in combination with machine learning models to effectively predict ACRs in genome assemblies.

17
The limits of Bayesian estimates of divergence times in measurably evolving populations

Ivanov, S.; Fosse, S.; dos reis, M.; Duchene, S.

2026-03-03 bioinformatics 10.64898/2026.02.28.708707 medRxiv
Top 0.6%
0.8%
Show abstract

Bayesian inference of divergence times for extant species using molecular data is an unconventional statistical problem: Divergence times and molecular rates are confounded, and only their product, the molecular branch length, is statistically identifiable. This means we must use priors on times and rates to break the identifiability problem. As a consequence, there is a lower bound in the uncertainty that can be attained under infinite data for estimates of evolutionary timescales using the molecular clock. With infinite data (i.e., an infinite number of sites and loci in the alignment) uncertainty in ages of nodes in phylogenies increases proportionally with their mean age, such that older nodes have higher uncertainty than younger nodes. On the other hand, if extinct taxa are present in the phylogeny, and if their sampling times are known (i.e., heterochronous data), then times and rates are identifiable and uncertainties of inferred times and rates go to zero with infinite data. However, in real heterochronous datasets (such as viruses and bacteria), alignments tend to be small and how much uncertainty is present and how it can be reduced as a function of data size are questions that have not been explored. This is clearly important for our understanding of the tempo and mode of microbial evolution using the molecular clock. Here we conducted extensive simulation experiments and analyses of empirical data to develop the infinite-sites theory for heterochronous data. Contrary to expectations, we find that uncertainty in ages of internal nodes scales positively with the distance to their closest tip with known age (i.e., calibration age), not their absolute age. Our results also demonstrate that estimation uncertainty decreases with calibration age more slowly in data sets with more, rather than fewer site patterns, although overall uncertainty is lower in the former. Our statistical framework establishes the minimum uncertainty that can be attained with perfect calibrations and sequence data that are effectively infinitely informative. Finally, we discuss the implications for viral sequence data sets. In a vast majority of cases viral data from outbreaks is not sufficiently informative to display infinite-sites behaviour and thus all estimates of evolutionary timescales will be associated with a degree of uncertainty that will depend on the size of the data set, its information content, and the complexity of the model. We anticipate that our framework is useful to determine such theoretical limits in empirical analyses of microbial outbreaks.

18
Directionality bias in T/A cloning

Dountcheva, V.; Bubulya, A.; Rouhana, L.

2026-02-12 molecular biology 10.64898/2026.02.11.705383 medRxiv
Top 0.6%
0.8%
Show abstract

T/A cloning is a popular method for generating recombinant DNA plasmids. This method relies on single A:T nucleotide base pairs between PCR product ends and vector. Theoretically, the directionality of insert ligation with relation to the vector is random. However, we have continuously observed directionality bias using the pGEM-T Vector System for T/A cloning in a Course-based Undergraduate Research Experience (CURE). Cloning of over 400 inserts has shown directional bias higher than 74% (p-value < 0.0005) "sense" to the T7 promoter of the vector. Awareness of biased insertion in our applications reduces time and cost in cloning and downstream analyses.

19
Clarified an rDNA Gene Unit Pattern with (CTTT)n and (CT)n Microsatellites Aggregation Ahead of and Behind the Gene in Human Genome

Shen, J.; Tang, S.; Xia, Y.; Qin, J.; Xu, H.; Tan, Z.

2026-03-24 genetics 10.64898/2026.03.22.713381 medRxiv
Top 0.6%
0.8%
Show abstract

BackgroundConventional models of human ribosomal DNA (rDNA) array organization have historically depended on transcription-centric boundaries, partitioning the unit into a [~]13 kb rDNA transcription region and a monolithic [~]31 kb intergenic spacer (IGS). While our previous identification of Duplication Segment Units (DSUs) mapped these arrays based on an intuitive analysis of the microsatellite density landscape of the complete reference human genome, our present deep mining of this landscape has revealed a more accurate rDNA Gene Unit Pattern. Methods & ResultsIn this study, we conducted a deep mining analysis of our previously established microsatellite density landscape of the T2T-CHM13 assembly, focusing specifically on nucleolar organizing regions (NORs). We suggest a more accurate rDNA Gene Unit Pattern containing a (CTTT)n microsatellite aggregation ahead of the rDNA gene and a (CT)n microsatellite aggregation behind the gene, rather than a pattern featuring an IGS region inserted between two rDNA genes. ConclusionsA correct rDNA gene pattern of the human genome probably includes a (CTTT)n microsatellite aggregation ahead of the gene and a (CT)n microsatellite aggregation behind it, which possibly constitute cis- and trans-regulating regions; the (CTTT)n and (CT)n microsatellite aggregations may provide two different local stable DNA structures for regulatory protein binding.

20
The Stochastic System Identification Toolkit (SSIT) to model, fit, predict, and design experiments

Popinga, A. N.; Forman, J.; Svetlov, D.; Vo, H. D.; Munsky, B. E.

2026-03-08 bioinformatics 10.64898/2026.02.20.707039 medRxiv
Top 0.6%
0.8%
Show abstract

Biological data is prone to both intrinsic and extrinsic noise and variability between experimental replicas. That same stochasticity and heterogeneity can carry information about underlying biochemical mechanisms but, if not incorporated in modeling and probabilistic inference, can also bias parameter estimates and misguide predictions and, subsequently, experiment design. Mechanistic inference typically requires lengthy simulations (e.g., the Stochastic Simulation Algorithm (SSA)); approximations to chemical master equation (CME) solutions that lack rigorous error tracking; or deterministic averaging that lacks the complexity necessary to reflect the data. We introduce the Stochastic System Identification Toolkit (SSIT) - a fast, flexible, and open-source software package available on GitHub that makes use of MATLABs efficient and diverse computational architecture. The SSIT is designed for building, simulating, and solving chemical reaction models using ODEs, moments, SSA, Finite State Projection truncations of the CME, or hybrid methods; sensitivity analysis and Fisher information quantification; parameter fitting using likelihood- or Bayesian-based methods; handling of experimental noise and measurement errors using probabilistic distortion operators; and sequential experiment design that empowers users to save time and resources while gaining the most information possible out of their data. The SSIT also offers advanced modeling tools, including model reduction methods for increased efficiency and joint fitting of models and datasets with overlapping reactions or parameters. To facilitate the ease and speed of use, the SSIT provides a graphical user interface and ready-made, adaptable pipelines that can be run in the background from commandline or high-performance computing clusters. We demonstrate features of the SSIT on two experimental datasets: the first consists of published mRNA count data that reflect Saccharomyces cerevisiae yeast cell response to osmotic shock using single-cell single-molecule fluorescence in situ hybridization; the second consists of single-cell RNA sequencing measurements of 151 activating genes in breast cancer cells following treatment with dexamethasone. Author summaryWe present the Stochastic System Identification Toolkit (SSIT) to model, fit, and predict any data that can be interpreted as changing populations or counts through time, including but not limited to single-cell experiments, economics, epidemiology, ecology, sociology, agriculture, and biotechnology. The SSIT was constructed particularly for stochastic modeling, which is important for systems whose states may experience significant fluctuations from mean behavior, thus affecting the inference of the underlying rate parameters and predictions of subsequent behavior. The SSIT provides statistical inference tools for parameter estimation; sensitivity analysis and information calculation; handling of distortions to probability distributions caused by experimental or measurement processes (e.g., dropout in single-cell RNA sequence data and total fluorescence intensities versus spot counting/puncta analysis); and quantitative design of experiments. The SSIT also offers a variety of complex modeling tools, including model reduction methods and fitting of combined models/datasets that share some behavior but remain distinct (e.g., different genes responding a single stimulus). The SSIT generates pipelines for easy, efficient analyses to run in the MATLAB environment, in the background on commandline, or on high-performance computing clusters, thus facilitating users to make informed, time- and cost-effective decisions about their next set of experiments.